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Abstract 

In this paper, we describe a method for 
automatic creation of a knowledge source 
for text generation using information ex- 
traction over the Internet. We present a 
prototype system called PROFILE which 
uses a client-server architecture to ex- 
tract noun-phrase descriptions of enti- 
ties such as people, places, and organiza- 
tions. The system serves two purposes: 
as an information extraction tool, it al- 
lows users to search for textual descrip- 
tions of entities; as a utility to generate 
functional descriptions (FD), it is used in 
a functional-unification based generation 
system. We present an evaluation of the 
approach and its applications to natural 
language generation and summarization. 



In the news domain, a summary needs to refer 
to people, places, and organizations and provide 
descriptions that clearly identify the entity for the 
reader. Such descriptions may not be present in 
the original text that is being summarized. For ex- 
ample, the American pilot Scott O'Grady, downed 
in Bosnia in June of 1995, was unheard of by the 
American public prior to the incident. If a reader 
tuned into news on this event days later, descrip- 
tions from the initial articles may be more useful. 
A summarizer that has access to different descrip- 
tions will be able to select the description that best 
suits both the reader and the series of articles be- 
ing summarized. 

In this paper, we describe a system called 
PROFILE that tracks prior references to a given 
entity by extracting descriptions for later use in 
summarization. In contrast with previous work on 
information extraction, our work has the following 
features: 



X 



1 Introduction 

In our work to date on news summarization at 
Columbia University ( McKeown and Radev, 199"^; 



Radev, 1996), information is extracted from a se- 



ries of input news articles (MUG, 1992; Grishman 



et al., 1992) and is analyzed by a generation com- 



ponent to produce a summary that shows how per- 
ception of the event has changed over time. In this 
summarization paradigm, problems arise when in- 
formation needed for the summary is either miss- 
ing from the input article(s) or not extracted by 
the information extraction system. In such cases, 
the information may be readily available in other 
current news stories, in past news, or in online 
databases. If the summarization system can find 
the needed information in other online sources, 
then it can produce an improved summary by 
merging information from multiple sources with 
information extracted from the input articles. 



• It builds a database of profiles for entities by 
storing descriptions from a collected corpus of 
past news. 

• It operates in real time, allowing for connec- 
tions with the latest breaking, online news to 
extract information about the most recently 
mentioned individuals and organizations. 

• It collects and merges information from dis- 
tributed sources thus allowing for a more com- 
plete record of information. 

• As it parses and identifies descriptions, it 
builds a lexicalized, syntactic representation 
of the description in a form suitable for in- 
put to the FUF/SURGE language generation 
system ( [Elhadad, 1993| ; [Robin, 1994| ). 



The result is a system that can combine de- 
scriptions from articles appearing only a few min- 



utes before the ones being summarized with de- 
scriptions from past news in a permanent record 



the use of regular grammars to deUmit and iden- 
tify proper nouns (Mani et al., 1993; Paik et al. 



for future use. Its utility lies in its potential for 1994), the use of extensive name lists, place names 



representing entities, present in one article, with 
descriptions found in other articles, possibly com- 
ing from another source. 

Since the system constructs a lexicalized, syn- 
tactic functional description (FD) from the ex- 
tracted description, the generator can re-use the 
description in new contexts, merging it with other 
descriptions, into a new grammatical sentence. 
This would not be possible if only canned strings 



were used, with no information about their inter- 



nal structure. Thus, in addition to collecting a 
knowledge source which provides identifying fea- 
tures of individuals, PROFILE also provides a lex- 
icon of domain appropriate phrases that can be in- 
tegrated with individual words from a generator's 
lexicon to flexibly produce summary wording. 

We have extended the system by semantically 
categorizing descriptions using WordNet ( Miller et 
al., 1990), so that a generator can more easily de- 
termine which description is relevant in different 
contexts. 

PROFILE can also be used in a real-time fash- 
ion to monitor entities and the changes of descrip- 
tions associated with them over the course of time. 

In the following sections, we first overview re- 
lated work in the area of information extraction. 
We then turn to a discussion of the system com- 
ponents which build the profile database, followed 
by a description of how the results are used in gen- 
eration. We close with our current directions, de- 
scribing what parameters can influence a strategy 
for generating a sequence of anaphoric references 
to the same entity over time. 



B — Related Work 



Research related to ours falls into two main cate- 
gories: extraction of information from input text 
and construction of knowledge sources for genera- 
tion. 

2.1 Information Extraction 

Work on information extraction is quite broad and 
covers far more topics and problems than the in- 
formation extraction problem we address. We 
restrict our comparison here to work on proper 
noun extraction, extraction of people descriptions 
in various information extraction systems devel- 
oped for the message understanding conferences 
( MUG, 1992 ), and use of extracted information 
for question answering. 

Techniques for proper noun extraction include 



titles and "gazetteers" in conjunction with par- 
tial grammars in order to recognize proper nouns 
as unknown words in close proximity to known 
words dCowie et al, 1992t [Aberdeen et al., 1992| ), 
statistical training to learn, for example, Spanish 
names, from online corpora (Ayuso et al., 1992), 
and the use of concept based pattern matchers 
that use semantic concepts as patter n categories 
as well as p a rt-of-speech informati on (Weischedel 
et al., 1993| ; Lehnert et al., 1993 ). In addition. 



some researchers have explored the use of both lo- 
cal con text surrounding t he hypothesized prope r 
nouns ( McDonald, 1993 ; Coates-Stephcns, 1991 ) 
and the larger discourse context ( Mani et al., 1993 ) 
to improve the accuracy of proper noun extrac- 
tion when large known word lists are not available. 
Like this research, our work also aims at extract- 
ing proper nouns without the aid of large word 
lists. We use a regular grammar encoding part-of- 
speech categories to extract certain text patterns 
and we use WordNet (Miller et al., 199C) to pro- 
vide semantic filtering. 

Our work on extracting descriptions is quite 
similar to the work carried out under the DARPA 
message understanding program for extracting de- 
scriptions (MUG, 1992). The purpose for and the 
scenario in which description extraction is done is 
quite different, but the techniques are very simi- 
lar. It is based on the paradigm of representing 
patterns that express the kinds of descriptions we 
expect; unlike previous work we do not encode se- 
mantic categories in the patterns since we want to 
capture all descriptions regardless of domain. 



Research on a system called Murax ( [Kupiec 



1993) is similar to ours from a different perspec- 
tive. Murax also extracts information from a text 
to serve directly in response to a user question. 
Murax uses lexico-syntactic patterns, collocational 
analysis, along with information retrieval statis- 
tics, to find the string of words in a text that is 
most likely to serve as an answer to a user's wh- 
query. In our work, the string that is extracted 
may be merged, or regenerated, as part of a larger 
textual summary. 

2.2 Construction of Knowledge Sources 
for Generation 

The construction of a database of phrases for re- 
use in generation is quite novel. Previous work 
on extraction of collocations for use in genera- 
tion (3madja and McKeown, 1991) is related in 



that full phrases are extracted and syntactically 
typed so that they can be merged with individual 
words in a generation lexicon to produce a full sen- 
tence. However, extracted collocations were used 
only to determine realization of an input concept. 
In our work, stored phrases would be used to pro- 
vide content that can identify a person or place 
for a reader, in addition to providing the actual 
phrasing. 

3 Creation of a Database of 
Profiles 

Figure |] shows the overall architecture of PRO- 
FILE and the two interfaces to it (a user interface 
on the World-Wide Web and an interface to a nat- 
ural language generation system). In this section, 
we describe the extraction component of PRO- 
FILE, the following section focuses on the uses of 
PROFILE for generation, and Section ^ describes 
the Web-based interface. 




Description Categorization 



Description Storage 



Web Interface 



PROFILE 



Conversion to FDs 



Surface Generation 



Figure 1: Overall Architecture of PROFILE. 



3.1 Extraction of entity names from old 
newswire 

To seed the database with an initial set of descrip- 
tions, we used a 1.4 MB corpus containing Reuters 
newswire from March to June of 1995. The pur- 
pose of such an initial set of descriptions is twofold. 
First, it allows us to test the other components of 
the system. Furthermore, at the time a descrip- 
tion is needed it limits the amount of online full 
text, Web search that must be done. At this stage, 
search is limited to the database of retrieved de- 
scriptions only, thus reducing search time as no 



connections will be made to external news sources 
at the time of the query. Only when a suitable 
stored description cannot be found will the sys- 
tem initiate search of additional text. 

• Extraction of candidates for proper 
nouns. After tagging the c orpus using th e 
POS part-of-speech tagger (Church, 1988 ), 
we used a CREP (|Duford, 1993| ) regular 
grammar to first extract all possible candi- 
dates for entities. These consist of all se- 
quences of words that were tagged as proper 
nouns (NP) by POS. Our manual analysis 
showed that out of a total of 2150 entities 
recovered in this way, 1139 (52.9%) are not 
names of entities. Among these are n-grams 
such as "Prime Minister" or "Egyptian Pres- 
ident" which were tagged as NP by POS. Ta- 
ble H shows how many entities we retrieve 
at this stage, and of them, how many pass 
the semantic filtering test. The numbers in 
the left-hand column refer to two-word noun 
phrases that identify entities (e.g., "Bill Clin- 
ton" ) . Counts for three- word noun phrases 
are shown in the right-hand column. We show 
counts for multiple and unique occurrences of 
the same noun phrase. 

• Weeding out of false candidates. Our 

system analyzed all candidates for entity 
names using WordNet ( Miller et al., 199C| ) 
and removed from consideration those that 
contain words appearing in WordNet 's dictio- 
nary. This resulted in a list of 421 unique 
entity names that we used for the automatic 
description extraction stage. All 421 entity 
names retrieved by the system are indeed 
proper nouns. 

3.2 Extraction of descriptions 

There are two occasions on which we extract de- 
scriptions using finite-state techniques. The first 
case is when the entity that we want to describe 
was already extracted automatically (see Subsec- 
tion 3J) and exists in PROFILE'S database. The 



second case is when we want a description to be re- 
trieved in real time based on a request from either 
a Web user or the generation system. 

There exist many live sources of newswire on 
the Internet that can be used for this second case. 
Some that merit our attention are the ones that 
can be accessed remotely through small client pro- 
grams that don't require any sophisticated proto- 
cols to access the newswire articles. Such sources 
include HTTP-accessible sites such as the Reuters 



stage 


Two-word entities 


Three-word entities | 


Entities 


Unique Entities 


Entities 


Unique Entities 


POS tagging only 
After WordNet checkup 


9079 
1509 


1546 
395 


2617 
81 


604 
26 



Table 1: Two- word and three- word entities retrieved by the system. 



site at www. yahoo. com and CNN Interactive at 
www.cnn.com, as well as others such as ClariNet 
which is propagated through the NNTP protocol. 
All these sources share a common characteristic 
in that they are all updated in real time and all 
contain information about current events. Hence, 
they are therefore likely to satisfy the criteria of 
pertinence to our task, such as the likelihood of the 
sudden appearance of new entities that couldn't 
possibly have been included a priori in the gener- 
ation lexicon. 

Our system generates finite-state representa- 
tions of the entities that need to be described. An 
example of a finite-state description of the entity 
"Yasser Arafat" is shown in Figure g. These full 
expressions are used as input to the description 
finding module which uses them to find candidate 
sentences in the corpus for finding descriptions. 
Since the need for a description may arise at a 
later time than when the entity was found and 
may require searching new text, the description 
finder must first locate these expressions in the 
text. 

These representations are fed to CREP which 
extracts noun phrases on either side of the en- 
tity (either pre-modifiers or appositions) from the 
news corpus. The finite-state grammar for noun 
phrases that we use represents a variety of differ- 
ent syntactic structures for both pre-modificrs and 
appositions. Thus, they may range from a simple 
noun (e.g., "president Bill Clinton") to a much 
longer expression (e.g., "Gilberto Rodriguez Ore- 
juela, the head of the Cali cocaine cartel"). Other 
forms of descriptions, such as relative clauses, are 
the focus of ongoing implementation. 

Table g shows some of the different patterns 
retrieved. 

3.3 Categorization of descriptions 

We use WordNet to group extracted descriptions 
into categories. For all words in the description, 
we try to find a WordNet hypernym that can re- 
strict the semantics of the description. Currently, 
we identify concepts such as "profession" , "nation- 
ality" , and "organization" . Each of these concepts 
is triggered by one or more words (which we call 
"triggers") in the description. Table || shows some 



examples of descriptions and the concepts under 
which they are classified based on the WordNet hy- 
pernyms for some "trigger" words. For example, 
all of the following "triggers" in the list "minister" , 
"head" , "administrator" , and "commissioner" can 
be traced up to "leader" in the WordNet hierarchy. 

3.4 Organization of descriptions in a 
database of profiles 

For each retrieved entity we create a new profile 
in a database of profiles. We keep information 
about the surface string that is used to describe 
the entity in newswire (e.g., "Addis Ababa"), 
the source of the description and the date that 
the entry has been made in the database (e.g., 
"reuters95_06_25"). In addition to these pieces 
of meta-information, all retrieved descriptions and 
their frequencies are also stored. 

Currently, our system doesn't have the capa- 
bility of matching references to the same entity 
that use different wordings. As a result, we keep 
separate profiles for each of the following: "Robert 
Dole" , "Dole" , and "Bob Dole" . We use each of 
these strings as the key in the database of descrip- 
tions. 

Figure shows the profile associated with the 
key "John Major" . 



KEY: John major 

SOURCE: reuters95_03-06_.nws 

DESCRIPTION: british prime minister 

FREQUENCY: 75 

DESCRIPTION: prime minister 

FREQUENCY: 58 

DESCRIPTION: a defiant british prime minister 

FREQUENCY: 2 

DESCRIPTION: his british counterpart 

FREQUENCY: 1 



Figure 3: Profile for John Major. 

The database of profiles is updated every time 
a query retrieves new descriptions matching a cer- 
tain key. 

4 Generation 

We have made an attempt to reuse the descrip- 
tions, retrieved by the system, in more than a triv- 
ial way. The content planner of a language gener- 



SEARCH_STRING = ( ({NDUN_PHRASE}{SPACE})+{SEARCH_0}) I ({SEARCH_0}{SPACE}{COMMA}{SPACE}{NDUN_PHRASE}) 
SEARCH_109 = [Yy]asser{T_NDUN}{SPACE}[Aa]rafat{T_NOUN} 
SEARCH.O = {SEARCH.l} I {SEARCH_2} I ... I {SEARCH_109} I . . . 



Figure 2: Finite-state representation of "Yasser Arafat'' 



Example 


Trigger Term 


Semantic Category 


Addis Ababa, the Ethiopian capital 


capital 


location 


South Africa's main black opposition leader, Mangosuthu Buthelezi 


leader 


occupation 


Boerge Ousland, 33 


33 


age 


maverick French ex-soccer boss Bernard Tapie 


boss 


occupation 


Italy's former prime minister, Silvio Berlusconi 


minister 


occupation 


Sinn Fein, the political arm of the Irish Republican Army 


arm 


organization 



Table 2: Examples of retrieved descriptions. 



MILAN - A judge ordered Italy's former prime 
minister Silvio Berlusconi to stand trial in Jan- 
uary on corruption charges in a ruling that could 
destroy the media magnate's hope of returning to 
high office. 

Figure 4: Source article. 



ation system that needs to present an entity to the 
user that he has not seen previously, might want 
to include some background information about it. 
However, in case the extracted information doesn't 
contain a handy description, the system can use 
some descriptions retrieved by PROFILE. 

4.1 Transformation of descriptions into 
Functional Descriptions 

Since our major goal in extracting descriptions 
from on-line corpora was to use them in gener- 
ation, we have written a utility which converts 
ffiiite-state descriptions retrieved by PROFILE 
into functional descriptions (FD) ( Elhadad, 1991 ) 
that we can use directly in generation. A descrip- 
tion retrieved by the system from the article in ^ 
is shown in Figure o. The corresponding FD is 
shown in Figure @. 



Italy@NPNP ' s®$ former@JJ primeOJJ 
ministerONN SilvioONPNP BerlusconiONPNP 



Figure 5: Retrieved description for Silvio Berlus- 
coni. 

We have implemented a TCP/IP interface to 
Surge. The FD generation component uses this 
interface to send a new FD to the surface realiza- 
tion component of Surge which generates an En- 
glish surface form corresponding to it. 



C (cat np) 
(complex apposition) 
(restrictive no) 
(distinct ~(((cat common) 

(possessor ((cat common) 

(determiner none) 
(lex ' 'Italy"))) 
(classifier ((cat noun-compound) 

(classifier ((lex ''former''))) 
(head ((lex ''prime''))))) 
(head ((lex ''minister'')))) 
((cat person-name) 
(first-name ((lex ''Silvio''))) 
(last -name ((lex ''Berlusconi'')))))))) 



Figure 6: Generated FD for Silvio Berlusconi. 

4.2 Lexicon creation 

We have identified several major advantages of 
using FDs produced by the system in generation 
compared to using canned phrases. 

• Grammaticality. The deeper representa- 
tion allows for grammatical transformations, 
such as aggregation: e.g., "president Yeltsin" 
+ "president Clinton" can be generated as 
"presidents Yeltsin and Clinton" . 

• Unification with existing ontologies. 

E.g., if an ontology contains information 
about the word "president" as being a realiza- 
tion of the concept "head of state" , then un- 
der certain conditions, the description can be 
replaced by one referring to "head of state" . 

• Generation of referring expressions. In 

the previous example, if "president Bill Clin- 
ton" is used in a sentence, then "head of 
state" can be used as a referring expression 
in a subsequent sentence. 

• Enhancement of descriptions. If we have 
retrieved "prime minister" as a description for 
Silvio Berlusconi, and later we obtain knowl- 
edge that someone else has become Italy's 



primer minister, then we can generate "for- 
mer prime minister" using a transformation 
of the old FD. 

• Lexical choice. When different descrip- 
tions are automatically marked for semantics, 
PROFILE can prefer to generate one over an- 
other based on semantic features. This is 
useful if a summary discusses events related 
to one description associated with the entity 
more than the others. 

• Merging lexicons. The lexicon generated 
automatically by the system can be merged 
with a domain lexicon generated manually. 

These advantages look very promising and we 
will be exploring them in detail in our work on 
summarization in the near future. 

5 Coverage and Limitations 

In this section we provide an analysis of the capa- 
bilities and current limitations of PROFILE. 

5.1 Coverage 

At the current stage of implementation, PROFILE 
has the following coverage. 

• Syntactic coverage. Currently, the sys- 
tem includes an extensive finite-state gram- 
mar that can handle various pre-modifiers 
and appositions. The grammar matches arbi- 
trary noun phrases in each of these two cases 
to the extent that the POS part-of-speech tag- 
ger provides a correct tagging. 



Precision. In Subsection 3.1 we showed the 



precision of the extraction of entity names. 
Similarly, we have computed the precision of 
retrieved 611 descriptions using randomly se- 
lected entities from the list retrieved in Sub- 
section 3.1, Of the 611 descriptions, 551 
(90.2%) were correct. The others included 
a roughly equal number of cases of incorrect 
NP attachment and incorrect part-of-speech 
assignment. For our task (symbolic text gen- 
eration), precision is more important than re- 
call; it is critical that the extracted descrip- 
tions are correct in order to be converted to 
FD and generated. 

• Length of descriptions. The longest de- 
scription retrieved by the system was 9 lexical 
items long: "Maurizio Gucci, the former head 
of Italy's Gucci fashion dynasty" . The short- 
est descriptions are 1 lexical item in length - 
e.g. "President Bill Chnton" . 



• Protocol coverage. We have implemented 
retrieval facilities to extract descriptions us- 
ing the NNTP (Usenet News) and HTTP 
(World-Wide Web) protocols. 

5.2 Limitations 

Our system currently doesn't handle entity cross- 
referencing. It will not realize that "Clinton" and 
"Bill Clinton" refer to the same person. Nor will 
it link a person's profile with the profile of the 
organization of which he is a member. 

At this stage, the system generates functional 
descriptions (FD), but they are not being used in 
a summarization system yet. 

6 Current Directions 

One of the more important current goals is to 
increase coverage of the system by providing in- 
terfaces to a large number of on-line sources of 
news. We would ideally want to build a compre- 
hensive and shareable database of profiles that can 
be queried over the World-Wide Web. 

We need to refine the algorithm to handle cases 
that are currently problematic. For example, pol- 
ysemy is not properly handled. For instance, we 
would not label properly noun phrases such as 
"Rice University", as it contains the word "rice" 
which can be categorized as a food. 

Another long-term goal of our research is the 
generation of evolving summaries that continu- 
ously update the user on a given topic of inter- 
est. In that case, the system will have a model 
containing all prior interaction with the user. To 
avoid repetitiveness, such a system will have to re- 
sort to using different descriptions (as well as refer- 
ring expressions) to address a specific entityR. We 
will be investigating an algorithm that will select a 
proper ordering of multiple descriptions referring 
to the same person. 

After we collect a series of descriptions for each 
possible entity, we need to decide how to select 
among all of them. There are two scenarios. In 
the first one, we have to pick one single descrip- 
tion from the database which best fits the sum- 
mary that we are generating. In the second sce- 
nario, the evolving summary, we have to generate 
a sequence of descriptions, which might possibly 
view the entity from different perspectives. We 



^Our corpus analysis supports this proposition - 
a large number of threads of summaries on the same 
topic from the Reuters and UPI newswire used up to 
10 different referring expressions (mostly of the type 
of descriptions discussed in this paper) to refer to the 
same entity. 



are investigating algorithms that will decide the 
order of generation of the different descriptions. 
Among the factors that will influence the selec- 
tion and ordering of descriptions, we can note the 
user's interests, his knowledge of the entity, the fo- 
cus of the summary (e.g., "democratic presidential 
candidate" for Bill Clinton, vs. "U.S. president"). 
We can also select one description over another 
based on how recent they have been included in 
the database, whether or not one of them has been 
used in a summary already, whether the summary 
is an update to an earlier summary, and whether 
another description from the same category has 
been used already. We have yet to decide under 
what circumstances a description needs to be gen- 
erated at all. 

We are interested in implementing existing al- 
gorithms or designing our own that will match dif- 
ferent instances of the same entity appearing in 
different syntactic forms - e.g., to establish that 
"PLO" is an alias for the "Palestine Liberation 
Organization" . We will investigate using cooccur- 
rence information to match acronyms to full or- 
ganization names and alternative spellings of the 
same name with each other. 

An important application that we are consid- 
ering is applying the technology to text available 
using other protocols - such as SMTP (for elec- 
tronic mail) and retrieve descriptions for entities 
mentioned in such messages. 

We will also look into connecting the current 
interface with news available to the Internet with 
an existing search engine such as Lycos (www.- 
lycos.com) or AltaVista (www. altavista. digital. - 
com). We can then use the existing indices of 
all Web documents mentioning a given entity as 
a news corpus on which to perform the extraction 
of descriptions. 

Fina lly, we will inve stigate the creation of 
KQML dPininet al., 1994| ) interfaces to the differ- 
ent components of PROFILE which will be linked 
to other information access modules at Columbia 
University. 

7 Contributions 

We have described a system that allows users to 
retrieve descriptions of entities using a Web-based 
search engine. Figure shows the Web interface 
to PROFILE. Users can select an entity (such as 
"John Major"), specify what semantic classes of 
descriptions they want to retrieve (e.g., age, posi- 
tion, nationality) as well as the maximal number 
of queries that they want. They can also specify 
which sources of news should be searched. Cur- 



rently, the system has an interface to Reuters at 
www.yahoo.com. The CNN Web site, and to all 
local news delivered via NNTP to our local news 
domain. 



HBBaSiBSa 



PROFILE 



Welcome to PEOFILE Please f< 




Figure 7: Web-based interface to PROFILE. 

The Web-based interface is accessible publicly 
(currently within Columbia University only). All 
queries are cached and the descriptions retrieved 
can be reused in a subsequent query. We believe 
that such an approach to information extraction 
can be classified as a collaborative database. 

The FD generation component produces syn- 
tactically correct functional descriptions that can 
be used to generate English-language descriptions 
using FUF and Surge, and can also be used in a 
general-purpose summarization system in the do- 
main of current news. 

All components of the system assume no prior 
domain knowledge and are therefore portable to 
many domains - such as sports, entertainment, 
and business. 
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